Table of Contents

Description:  

The purpose of the case study is to classify a given silhouette as one of four different types of vehicle, using a set of features extracted from the silhouette. The vehicle may be viewed from one of many different angles.

Four "Corgie" model vehicles were used for the experiment: a double decker bus, Cheverolet van, Saab 9000 and an Opel Manta 400 cars. This particular combination of vehicles was chosen with the expectation that the bus, van and either one of the cars would be readily distinguishable, but it would be more difficult to distinguish between the cars.

The purpose is to classify a given silhouette as one of three types of vehicle, using a set of features extracted from the silhouette. The vehicle may be viewed from one of many different angles.

The points distribution for this case is as follows:

  • Data pre-processing - Understand the data and treat missing values (Use box plot), outliers (5 points)
  • Understanding the attributes - Find relationship between different attributes (Independent variables) and choose carefully which all attributes have to be a part of the analysis and why (5 points)
  • Use PCA from scikit learn and elbow plot to find out reduced number of dimension (which covers more than 95% of the variance) - 10 points
  • Use Support vector machines and use grid search (try C values - 0.01, 0.05, 0.5, 1 and kernel = linear, rbf) and find out the best hyper parameters and do cross validation to find the accuracy. (10 points)

Attribute Information:

ATTRIBUTES

COMPACTNESS (average perim)**2/area

CIRCULARITY (average radius)**2/area

DISTANCE CIRCULARITY area/(av.distance from border)**2

RADIUS RATIO (max.rad-min.rad)/av.radius

PR.AXIS ASPECT RATIO (minor axis)/(major axis)

MAX.LENGTH ASPECT RATIO (length perp. max length)/(max length)

SCATTER RATIO (inertia about minor axis)/(inertia about major axis)

ELONGATEDNESS area/(shrink width)**2

PR.AXIS RECTANGULARITY area/(pr.axis length*pr.axis width)

MAX.LENGTH RECTANGULARITY area/(max.length*length perp. to this)

SCALED VARIANCE (2nd order moment about minor axis)/area ALONG MAJOR AXIS

SCALED VARIANCE (2nd order moment about major axis)/area ALONG MINOR AXIS

SCALED RADIUS OF GYRATION (mavar+mivar)/area

SKEWNESS ABOUT (3rd order moment about major axis)/sigma_min**3 MAJOR AXIS

SKEWNESS ABOUT (3rd order moment about minor axis)/sigma_maj**3 MINOR AXIS

KURTOSIS ABOUT (4th order moment about major axis)/sigma_min**4 MINOR AXIS

KURTOSIS ABOUT (4th order moment about minor axis)/sigma_maj**4 MAJOR AXIS

HOLLOWS RATIO (area of hollows)/(area of bounding polygon)

Where sigma_maj2 is the variance along the major axis and sigma_min2 is the variance along the minor axis, and

area of hollows= area of bounding poly-area of object

The area of the bounding polygon is found as a side result of the computation to find the maximum length. Each individual length computation yields a pair of calipers to the object orientated at every 5 degrees. The object is propagated into an image containing the union of these calipers to obtain an image of the bounding polygon.

NUMBER OF CLASSES

CAR, BUS, VAN

Importing Packages and Reading file as DataFrame

In [1]:
import numpy as np #import numpy
import pandas as pd #import pandas
import seaborn as sns # import seaborn
import matplotlib.pyplot as plt #import pyplot
from scipy.stats import pearsonr #for pearson's correlation

from sklearn.model_selection import train_test_split #for splitting the data in train and test
from sklearn.preprocessing import StandardScaler,MinMaxScaler,RobustScaler #for various scaling methods
from sklearn.linear_model import LogisticRegression #for LogisticRegression
from sklearn.naive_bayes import GaussianNB #for NaiveBayes
from sklearn.neighbors import KNeighborsClassifier #for KNN
from sklearn.svm import SVC #for Support vector classifier


from sklearn.tree import DecisionTreeClassifier #for decision tree classification
#from sklearn.feature_extraction.text import CountVectorizer  #DT does not take strings as input for the model fit step....
from IPython.display import Image  #for image
from sklearn import tree #for tree
from os import system #using user environment
from sklearn.ensemble import BaggingClassifier #for bagging classifier
from sklearn.ensemble import AdaBoostClassifier #for adaptive boosting
from sklearn.ensemble import GradientBoostingClassifier #for gradient boosting
from sklearn.ensemble import RandomForestClassifier #for random forest
from sklearn.preprocessing import LabelEncoder #for lebel encoder
from scipy.stats import zscore #for zscore
from sklearn.decomposition import PCA #for PCA
from sklearn.model_selection import KFold,cross_val_score #for cross validation

from sklearn.tree import export_graphviz #for exporting dot data
from sklearn.externals.six import StringIO  #for stringIO
from IPython.display import Image  #for including image
import pydotplus #for dot data
import graphviz #for visualizing decision tree
from statistics import median,mean #for median and mean functions

from sklearn.metrics import accuracy_score,confusion_matrix,recall_score #for accuracy matrices
from sklearn.metrics import precision_score,classification_report,roc_auc_score,precision_score #for accuracy matrices
C:\Users\Ajay\Anaconda3\lib\site-packages\sklearn\externals\six.py:31: DeprecationWarning: The module is deprecated in version 0.21 and will be removed in version 0.23 since we've dropped support for Python 2.7. Please rely on the official version of six (https://pypi.org/project/six/).
  "(https://pypi.org/project/six/).", DeprecationWarning)
In [2]:
DataFrame = pd.read_csv('vehicle.csv',dtype={'class': 'category'}) #reading the CSV file
DataFrame.head(10) #to check head of the dataframe
Out[2]:
compactness circularity distance_circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio scatter_ratio elongatedness pr.axis_rectangularity max.length_rectangularity scaled_variance scaled_variance.1 scaled_radius_of_gyration scaled_radius_of_gyration.1 skewness_about skewness_about.1 skewness_about.2 hollows_ratio class
0 95 48.0 83.0 178.0 72.0 10 162.0 42.0 20.0 159 176.0 379.0 184.0 70.0 6.0 16.0 187.0 197 van
1 91 41.0 84.0 141.0 57.0 9 149.0 45.0 19.0 143 170.0 330.0 158.0 72.0 9.0 14.0 189.0 199 van
2 104 50.0 106.0 209.0 66.0 10 207.0 32.0 23.0 158 223.0 635.0 220.0 73.0 14.0 9.0 188.0 196 car
3 93 41.0 82.0 159.0 63.0 9 144.0 46.0 19.0 143 160.0 309.0 127.0 63.0 6.0 10.0 199.0 207 van
4 85 44.0 70.0 205.0 103.0 52 149.0 45.0 19.0 144 241.0 325.0 188.0 127.0 9.0 11.0 180.0 183 bus
5 107 NaN 106.0 172.0 50.0 6 255.0 26.0 28.0 169 280.0 957.0 264.0 85.0 5.0 9.0 181.0 183 bus
6 97 43.0 73.0 173.0 65.0 6 153.0 42.0 19.0 143 176.0 361.0 172.0 66.0 13.0 1.0 200.0 204 bus
7 90 43.0 66.0 157.0 65.0 9 137.0 48.0 18.0 146 162.0 281.0 164.0 67.0 3.0 3.0 193.0 202 van
8 86 34.0 62.0 140.0 61.0 7 122.0 54.0 17.0 127 141.0 223.0 112.0 64.0 2.0 14.0 200.0 208 van
9 93 44.0 98.0 NaN 62.0 11 183.0 36.0 22.0 146 202.0 505.0 152.0 64.0 4.0 14.0 195.0 204 car
  • Assigned 'Category' datatype to 'class' column while reading the file.
  • Also, 'class' column contains class data which can be the 'target' column in further stages
In [3]:
DataFrame.tail() #to check tail of the dataframe
Out[3]:
compactness circularity distance_circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio scatter_ratio elongatedness pr.axis_rectangularity max.length_rectangularity scaled_variance scaled_variance.1 scaled_radius_of_gyration scaled_radius_of_gyration.1 skewness_about skewness_about.1 skewness_about.2 hollows_ratio class
841 93 39.0 87.0 183.0 64.0 8 169.0 40.0 20.0 134 200.0 422.0 149.0 72.0 7.0 25.0 188.0 195 car
842 89 46.0 84.0 163.0 66.0 11 159.0 43.0 20.0 159 173.0 368.0 176.0 72.0 1.0 20.0 186.0 197 van
843 106 54.0 101.0 222.0 67.0 12 222.0 30.0 25.0 173 228.0 721.0 200.0 70.0 3.0 4.0 187.0 201 car
844 86 36.0 78.0 146.0 58.0 7 135.0 50.0 18.0 124 155.0 270.0 148.0 66.0 0.0 25.0 190.0 195 car
845 85 36.0 66.0 123.0 55.0 5 120.0 56.0 17.0 128 140.0 212.0 131.0 73.0 1.0 18.0 186.0 190 van

Exploratory Data Analysis

Shape of the data

In [4]:
print('\033[1m''Number of rows in dataframe',DataFrame.shape[0]) #for number of rows
print('\033[1m''Number of features in dataframe',DataFrame.shape[1]) #for number of features
Number of rows in dataframe 846
Number of features in dataframe 19
  • The dataset has 846 rows and 19 columns(features)

Data type of each attribute

In [5]:
DataFrame.dtypes.to_frame('Datatypes of attributes').T #for datatypes of attributes 
Out[5]:
compactness circularity distance_circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio scatter_ratio elongatedness pr.axis_rectangularity max.length_rectangularity scaled_variance scaled_variance.1 scaled_radius_of_gyration scaled_radius_of_gyration.1 skewness_about skewness_about.1 skewness_about.2 hollows_ratio class
Datatypes of attributes int64 float64 float64 float64 float64 int64 float64 float64 float64 int64 float64 float64 float64 float64 float64 float64 float64 int64 category
  • 4 features have int datatype and 14 have float datatype and 1 feature has category datatype

Checking the presence of missing values

In [6]:
DataFrame.isnull().sum().to_frame('Presence of missing values').T #for checking presence of missing values
Out[6]:
compactness circularity distance_circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio scatter_ratio elongatedness pr.axis_rectangularity max.length_rectangularity scaled_variance scaled_variance.1 scaled_radius_of_gyration scaled_radius_of_gyration.1 skewness_about skewness_about.1 skewness_about.2 hollows_ratio class
Presence of missing values 0 5 4 6 2 0 1 1 3 0 3 2 2 4 6 1 1 0 0
  • The dataset contains missing values in various dimensions. These will be trated in data preprocessing stage.

5 point summary of numerical attribute

In [7]:
DataFrame.describe().T #for 5 point summary
Out[7]:
count mean std min 25% 50% 75% max
compactness 846.0 93.678487 8.234474 73.0 87.00 93.0 100.0 119.0
circularity 841.0 44.828775 6.152172 33.0 40.00 44.0 49.0 59.0
distance_circularity 842.0 82.110451 15.778292 40.0 70.00 80.0 98.0 112.0
radius_ratio 840.0 168.888095 33.520198 104.0 141.00 167.0 195.0 333.0
pr.axis_aspect_ratio 844.0 61.678910 7.891463 47.0 57.00 61.0 65.0 138.0
max.length_aspect_ratio 846.0 8.567376 4.601217 2.0 7.00 8.0 10.0 55.0
scatter_ratio 845.0 168.901775 33.214848 112.0 147.00 157.0 198.0 265.0
elongatedness 845.0 40.933728 7.816186 26.0 33.00 43.0 46.0 61.0
pr.axis_rectangularity 843.0 20.582444 2.592933 17.0 19.00 20.0 23.0 29.0
max.length_rectangularity 846.0 147.998818 14.515652 118.0 137.00 146.0 159.0 188.0
scaled_variance 843.0 188.631079 31.411004 130.0 167.00 179.0 217.0 320.0
scaled_variance.1 844.0 439.494076 176.666903 184.0 318.00 363.5 587.0 1018.0
scaled_radius_of_gyration 844.0 174.709716 32.584808 109.0 149.00 173.5 198.0 268.0
scaled_radius_of_gyration.1 842.0 72.447743 7.486190 59.0 67.00 71.5 75.0 135.0
skewness_about 840.0 6.364286 4.920649 0.0 2.00 6.0 9.0 22.0
skewness_about.1 845.0 12.602367 8.936081 0.0 5.00 11.0 19.0 41.0
skewness_about.2 845.0 188.919527 6.155809 176.0 184.00 188.0 193.0 206.0
hollows_ratio 846.0 195.632388 7.438797 181.0 190.25 197.0 201.0 211.0

5 point summary understanding:

  • Outliers are present in 'radius_ratio','Pr.axis_aspect_ratio','max_length_aspect_ratio','scaled_radius_of_gyration.1','skewness_about.1' and'skewness_about' columns.
  • 'Pr.axis_aspect_ratio','max_length_aspect_ratio' and 'scaled_radius_of_gyration.1' are right skewed.
  • Missing values are Present in circularity 'distance_circularity', 'radius_ratio', 'pr.axis_aspect_ratio', 'scatter_ratio','elongatedness','pr.axis_rectangularity', 'scaled_variance', 'scaled_variance.1', 'scaled_radius_of_gyration','scaled_radius_of_gyration.1','skewness_about','skewness_about.1' and 'skewness_about.2' columns.
  • No negative values are present in the dataset

Distribution of numerical columns.

In [8]:
col_names=DataFrame.columns.values.tolist()
sns.set(context='notebook',style='whitegrid', palette='dark',font='sans-serif',font_scale=1.2,color_codes=True)    
fig, axes = plt.subplots(nrows=9, ncols=2)
count=0
for i in range (9):
    for j in range (2):
        col=col_names[count+j]            
        sns.distplot(DataFrame[col].values,ax=axes[i][j],bins=30,color="tab:cyan")
        axes[i][j].set_title(col,fontsize=17)
        fig=plt.gcf()
        fig.set_size_inches(8,20)
        plt.tight_layout()
    count=count+j+1
C:\Users\Ajay\Anaconda3\lib\site-packages\numpy\lib\histograms.py:824: RuntimeWarning: invalid value encountered in greater_equal
  keep = (tmp_a >= first_edge)
C:\Users\Ajay\Anaconda3\lib\site-packages\numpy\lib\histograms.py:825: RuntimeWarning: invalid value encountered in less_equal
  keep &= (tmp_a <= last_edge)
C:\Users\Ajay\Anaconda3\lib\site-packages\statsmodels\nonparametric\kde.py:447: RuntimeWarning: invalid value encountered in greater
  X = X[np.logical_and(X > clip[0], X < clip[1])] # won't work for two columns.
C:\Users\Ajay\Anaconda3\lib\site-packages\statsmodels\nonparametric\kde.py:447: RuntimeWarning: invalid value encountered in less
  X = X[np.logical_and(X > clip[0], X < clip[1])] # won't work for two columns.

Understanding from distributions

  • 'Pr.axis_aspect_ratio','max.length_aspect_ratio' and 'scaled_radius_of_gyration.1' are right skewed.
  • 'skewness_about.2', 'skewness_about.1', 'scaled_radius_of_gyration' and 'skewness_about.1' are somewhat normally distributed.

Note:

  • The Runtime warning is because of presense of missing values. The missing values are imputed further.

Distribution of Categorical(Target) column.

In [9]:
plot=sns.countplot(x=DataFrame['class'],data=DataFrame) #Countplot of 'class' 
In [10]:
DataFrame['class'].value_counts().to_frame('Target column distriution') #Value counts of Target column
Out[10]:
Target column distriution
car 429
bus 218
van 199

Understanding from distribution

  • The number of cars, buses and vans are 429, 218, 199 respectively.

Measure of skewness of numerical columns

In [11]:
DataFrame.skew().to_frame('Skewness measure').T #for measure of skewness
Out[11]:
compactness circularity distance_circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio scatter_ratio elongatedness pr.axis_rectangularity max.length_rectangularity scaled_variance scaled_variance.1 scaled_radius_of_gyration scaled_radius_of_gyration.1 skewness_about skewness_about.1 skewness_about.2 hollows_ratio
Skewness measure 0.381271 0.261809 0.106585 0.394978 3.830362 6.778394 0.607271 0.047847 0.770889 0.256359 0.651598 0.842034 0.279317 2.083496 0.776519 0.688017 0.249321 -0.226341

Checking the presence of outliers

In [12]:
col_names=DataFrame.columns.values.tolist()#column names    
fig, axes = plt.subplots(nrows=9, ncols=2) #create subplots 9rows x 2columns
count=0
for i in range (9):
    for j in range (2):
        col=col_names[count+j]            
        sns.boxplot(DataFrame[col].values,ax=axes[i][j],color="tab:cyan")
        axes[i][j].set_title(col,fontsize=17)
        fig=plt.gcf()
        fig.set_size_inches(8,20)
        plt.tight_layout()
    count=count+j+1

Understanding from boxplots

  • Prove the existance of outliers in 'radius_ratio','Pr.axis_aspect_ratio','max_length_aspect_ratio','scaled_radius_of_gyration.1' and'skewness_about' columns.
  • 'Pr.axis_aspect_ratio','max.length_aspect_ratio', 'skewness_about' and 'scaled_radius_of_gyration.1' are right skewed.

Data Preprocessing:

Label Encoding of variables

In [13]:
df_copy = DataFrame.copy() #making a copy of dataframe for preprocessing

encoder = LabelEncoder() #creating object of LabelEncoder 
df_copy['class'] = encoder.fit_transform(df_copy['class']).astype(int) #encoding 'class' column 
df_copy.head() #displaying head of encoded dataframe
Out[13]:
compactness circularity distance_circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio scatter_ratio elongatedness pr.axis_rectangularity max.length_rectangularity scaled_variance scaled_variance.1 scaled_radius_of_gyration scaled_radius_of_gyration.1 skewness_about skewness_about.1 skewness_about.2 hollows_ratio class
0 95 48.0 83.0 178.0 72.0 10 162.0 42.0 20.0 159 176.0 379.0 184.0 70.0 6.0 16.0 187.0 197 2
1 91 41.0 84.0 141.0 57.0 9 149.0 45.0 19.0 143 170.0 330.0 158.0 72.0 9.0 14.0 189.0 199 2
2 104 50.0 106.0 209.0 66.0 10 207.0 32.0 23.0 158 223.0 635.0 220.0 73.0 14.0 9.0 188.0 196 1
3 93 41.0 82.0 159.0 63.0 9 144.0 46.0 19.0 143 160.0 309.0 127.0 63.0 6.0 10.0 199.0 207 2
4 85 44.0 70.0 205.0 103.0 52 149.0 45.0 19.0 144 241.0 325.0 188.0 127.0 9.0 11.0 180.0 183 0

Checking & modifying datatypes after Label Encoding

In [14]:
df_copy['class'].dtype #for datatype
Out[14]:
dtype('int32')
In [15]:
df_copy[['class']] = df_copy[['class']].apply(pd.Categorical)#changing datatype of attribute to categorical
In [16]:
df_copy['class'].dtype #for datatype
Out[16]:
CategoricalDtype(categories=[0, 1, 2], ordered=False)

Dealing with missing values.

Imputing missing values with mean

In [17]:
df_copy['circularity'].fillna((df_copy['circularity'].mean()), inplace=True) #Imputing with mean
df_copy['distance_circularity'].fillna((df_copy['distance_circularity'].mean()), inplace=True) #Imputing with mean
df_copy['radius_ratio'].fillna((df_copy['radius_ratio'].mean()), inplace=True) #Imputing with mean
df_copy['pr.axis_aspect_ratio'].fillna((df_copy['pr.axis_aspect_ratio'].mean()), inplace=True) #Imputing with mean
df_copy['scatter_ratio'].fillna((df_copy['scatter_ratio'].mean()), inplace=True) #Imputing with mean
df_copy['elongatedness'].fillna((df_copy['elongatedness'].mean()), inplace=True) #Imputing with mean
df_copy['pr.axis_rectangularity'].fillna((df_copy['pr.axis_rectangularity'].mean()), inplace=True) #Imputing with mean
df_copy['scaled_variance'].fillna((df_copy['scaled_variance'].mean()), inplace=True) #Imputing with mean
df_copy['scaled_variance.1'].fillna((df_copy['scaled_variance.1'].mean()), inplace=True) #Imputing with mean
df_copy['scaled_radius_of_gyration'].fillna((df_copy['scaled_radius_of_gyration'].mean()), inplace=True) #Imputing with mean
df_copy['scaled_radius_of_gyration.1'].fillna((df_copy['scaled_radius_of_gyration.1'].mean()),inplace=True)#Imputing with mean
df_copy['skewness_about'].fillna((df_copy['skewness_about'].mean()), inplace=True) #Imputing with mean
df_copy['skewness_about.1'].fillna((df_copy['skewness_about.1'].mean()), inplace=True) #Imputing with mean
df_copy['skewness_about.2'].fillna((df_copy['skewness_about.2'].mean()), inplace=True) #Imputing with mean
In [18]:
df_copy.isnull().sum().to_frame('Presence of missing values').T #for checking presence of missing values
Out[18]:
compactness circularity distance_circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio scatter_ratio elongatedness pr.axis_rectangularity max.length_rectangularity scaled_variance scaled_variance.1 scaled_radius_of_gyration scaled_radius_of_gyration.1 skewness_about skewness_about.1 skewness_about.2 hollows_ratio class
Presence of missing values 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
In [19]:
df_copy.head(10) #check head of dataframe                                                                               
Out[19]:
compactness circularity distance_circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio scatter_ratio elongatedness pr.axis_rectangularity max.length_rectangularity scaled_variance scaled_variance.1 scaled_radius_of_gyration scaled_radius_of_gyration.1 skewness_about skewness_about.1 skewness_about.2 hollows_ratio class
0 95 48.000000 83.0 178.000000 72.0 10 162.0 42.0 20.0 159 176.0 379.0 184.0 70.0 6.0 16.0 187.0 197 2
1 91 41.000000 84.0 141.000000 57.0 9 149.0 45.0 19.0 143 170.0 330.0 158.0 72.0 9.0 14.0 189.0 199 2
2 104 50.000000 106.0 209.000000 66.0 10 207.0 32.0 23.0 158 223.0 635.0 220.0 73.0 14.0 9.0 188.0 196 1
3 93 41.000000 82.0 159.000000 63.0 9 144.0 46.0 19.0 143 160.0 309.0 127.0 63.0 6.0 10.0 199.0 207 2
4 85 44.000000 70.0 205.000000 103.0 52 149.0 45.0 19.0 144 241.0 325.0 188.0 127.0 9.0 11.0 180.0 183 0
5 107 44.828775 106.0 172.000000 50.0 6 255.0 26.0 28.0 169 280.0 957.0 264.0 85.0 5.0 9.0 181.0 183 0
6 97 43.000000 73.0 173.000000 65.0 6 153.0 42.0 19.0 143 176.0 361.0 172.0 66.0 13.0 1.0 200.0 204 0
7 90 43.000000 66.0 157.000000 65.0 9 137.0 48.0 18.0 146 162.0 281.0 164.0 67.0 3.0 3.0 193.0 202 2
8 86 34.000000 62.0 140.000000 61.0 7 122.0 54.0 17.0 127 141.0 223.0 112.0 64.0 2.0 14.0 200.0 208 2
9 93 44.000000 98.0 168.888095 62.0 11 183.0 36.0 22.0 146 202.0 505.0 152.0 64.0 4.0 14.0 195.0 204 1

Handeling Outliers with mean replacement

In [20]:
#meanradius_ratio = float(df_copy['radius_ratio'].mean()) #radius_ratio
#df_copy['radius_ratio'] = np.where(df_copy['radius_ratio'] >np.percentile(df_copy['radius_ratio'], 75), meanradius_ratio,df_copy['radius_ratio']) #replacing with mean

meanpraxis_aspect_ratio = float(df_copy['pr.axis_aspect_ratio'].mean()) #mean pr.axis_aspect_ratio
df_copy['pr.axis_aspect_ratio'] = np.where(df_copy['pr.axis_aspect_ratio'] >np.percentile(df_copy['pr.axis_aspect_ratio'], 75), meanpraxis_aspect_ratio,df_copy['pr.axis_aspect_ratio'])#replacing with mean

meanmaxlength_aspect_ratio = float(df_copy['max.length_aspect_ratio'].mean()) #mean max.length_aspect_ratio
df_copy['max.length_aspect_ratio'] = np.where(df_copy['max.length_aspect_ratio'] >np.percentile(df_copy['max.length_aspect_ratio'], 75), meanmaxlength_aspect_ratio,df_copy['max.length_aspect_ratio'])#replacing with mean

meanscaled_radius_of_gyration = float(df_copy['scaled_radius_of_gyration.1'].mean()) #mean scaled_radius_of_gyration.1
df_copy['scaled_radius_of_gyration.1'] = np.where(df_copy['scaled_radius_of_gyration.1'] >np.percentile(df_copy['scaled_radius_of_gyration.1'], 75), meanscaled_radius_of_gyration,df_copy['scaled_radius_of_gyration.1'])#replacing with mean

meanskewness_about = float(df_copy['skewness_about'].mean()) #mean skewness_about
df_copy['skewness_about'] = np.where(df_copy['skewness_about'] >np.percentile(df_copy['skewness_about'], 75),meanskewness_about ,df_copy['skewness_about'])#replacing with mean

#Boxplots after handeling outliers
col_names=df_copy.columns.values.tolist() #column names    
fig, axes = plt.subplots(nrows=9, ncols=2) #create subplots 9rows x 2columns
count=0
for i in range (9):
    for j in range (2):
        col=col_names[count+j]            
        sns.boxplot(df_copy[col].values,ax=axes[i][j],color="tab:cyan")
        axes[i][j].set_title(col,fontsize=17)
        fig=plt.gcf()
        fig.set_size_inches(8,20)
        plt.tight_layout()
    count=count+j+1

Understanding from boxplots after handeling outliers

  • After handeling Outliers in 'Pr.axis_aspect_ratio' and 'max_length_aspect_ratio' columns with mean replacement some outliers are created on the lower side, but the number of outliers is reduced.
  • After handeling Outliers in 'scaled_radius_of_gyration.1' and 'skewness_about' columns with mean replacement columns some outliers are created on the lower side , but the number of outliers is reduced.

Note:

  • 'radius_ratio' column is not handled by mean/median replacement because it was creating more outliers than before.

Pairplot of all features

In [21]:
sns.pairplot(df_copy)# pairplot of all features
Out[21]:
<seaborn.axisgrid.PairGrid at 0x23c0c988710>

Understanding from pairplot

  • Positive and negative correlation are present in multiple features.
  • The correlation is on the higher side for 'compactness', 'circularity', 'distance_circularity', 'radius_ratio', 'scatter_ratio', 'pr.axis_rectangularity', 'max.length_rectangularity', 'scaled_variance', 'scaled_variance.1','scaled_radius_of_gyration', 'skewness_about.2', 'pr.axis_aspect_ratio' columns

Corr plot of all features

In [22]:
plt.figure(figsize=(15,10)) #for adjusting figuresize
sns.heatmap(df_copy.corr(),annot=True) #for correlation plot
Out[22]:
<matplotlib.axes._subplots.AxesSubplot at 0x23c17cdb9b0>

Understanding from above corr plot:

  • Our objective is to predict the class of vehice, if the features are highly correlated the results will not be accurate.
  • For reducing the multicolinearity some highly correlated features should be removed.

  • 'compactness', 'circularity', 'distance_circularity', 'radius_ratio', 'scatter_ratio', 'pr.axis_rectangularity', 'max.length_rectangularity', 'scaled_variance', 'scaled_variance.1','scaled_radius_of_gyration', 'skewness_about.2', 'pr.axis_aspect_ratio', these columns are highly correlated(more than 0.50) to each other.

  • 'scaled_radius_of_gyration.1', 'skewness_about', 'skewness_about.1', 'hollows_ratio', 'max.length_aspect_ratio' the correlation between these columns is low to moderate(between -0.25 to 0.37).

Creating a copy of dataframe to be used in PCA

In [23]:
pca_df=df_copy.copy() #Copy of preprocessed dataframe to be used in PCA
pca_df.head() #Head of dataframe
Out[23]:
compactness circularity distance_circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio scatter_ratio elongatedness pr.axis_rectangularity max.length_rectangularity scaled_variance scaled_variance.1 scaled_radius_of_gyration scaled_radius_of_gyration.1 skewness_about skewness_about.1 skewness_about.2 hollows_ratio class
0 95 48.0 83.0 178.0 61.67891 10.000000 162.0 42.0 20.0 159 176.0 379.0 184.0 70.000000 6.000000 16.0 187.0 197 2
1 91 41.0 84.0 141.0 57.00000 9.000000 149.0 45.0 19.0 143 170.0 330.0 158.0 72.000000 9.000000 14.0 189.0 199 2
2 104 50.0 106.0 209.0 61.67891 10.000000 207.0 32.0 23.0 158 223.0 635.0 220.0 73.000000 6.364286 9.0 188.0 196 1
3 93 41.0 82.0 159.0 63.00000 9.000000 144.0 46.0 19.0 143 160.0 309.0 127.0 63.000000 6.000000 10.0 199.0 207 2
4 85 44.0 70.0 205.0 61.67891 8.567376 149.0 45.0 19.0 144 241.0 325.0 188.0 72.447743 9.000000 11.0 180.0 183 0

WITHOUT PCA:

Removing features

  • Removing features which have higher correlation >= 50%.
In [24]:
df_copy = df_copy.drop(['scaled_radius_of_gyration' ,'skewness_about.2','radius_ratio','distance_circularity', 'circularity', 'scatter_ratio','scaled_variance.1','pr.axis_rectangularity', 'max.length_rectangularity','scaled_variance'],axis=1) #Dropping 
df_copy.head() #Head of updated dataframe
Out[24]:
compactness pr.axis_aspect_ratio max.length_aspect_ratio elongatedness scaled_radius_of_gyration.1 skewness_about skewness_about.1 hollows_ratio class
0 95 61.67891 10.000000 42.0 70.000000 6.000000 16.0 197 2
1 91 57.00000 9.000000 45.0 72.000000 9.000000 14.0 199 2
2 104 61.67891 10.000000 32.0 73.000000 6.364286 9.0 196 1
3 93 63.00000 9.000000 46.0 63.000000 6.000000 10.0 207 2
4 85 61.67891 8.567376 45.0 72.447743 9.000000 11.0 183 0

Pairplot after removing features with higher correlation

In [25]:
sns.pairplot(df_copy) #Pairplot of features
Out[25]:
<seaborn.axisgrid.PairGrid at 0x23c1823dac8>

Scaling of columns

In [26]:
X = df_copy.drop('class',axis=1) #independent dimensions  
y = df_copy['class'] #selecting target column
X = X.apply(zscore) #Scaling with zscore

Train Test Split (70:30)

In [27]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.30,random_state=1) #train test split in 70:30 ratio

Models:

Naive Bayes

In [28]:
NB = GaussianNB() #Instantiate the Gaussian Naive bayes 
NB.fit(X_train,y_train) #Call the fit method of NB to train the model or to learn the parameters of model
y_predi = NB.predict(X_test) #Predict 
In [29]:
print('\033[1m''->'*63)
print('\033[1m''Confusion Matrix\n',confusion_matrix(y_test,y_predi)) #for confusion matrix
print('-'*30)
NB_accuracy = accuracy_score(y_test,y_predi)
print('Accuracy of Naive Bayes :{:.2f}'.format(NB_accuracy)) #for accuracy score
print('-'*30)
print('\n Classification Report\n',classification_report(y_test,y_predi)) #for classification report
print('->'*63)
->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->
Confusion Matrix
 [[46  9  4]
 [25 95 13]
 [ 9 12 41]]
------------------------------
Accuracy of Naive Bayes :0.72
------------------------------

 Classification Report
               precision    recall  f1-score   support

           0       0.57      0.78      0.66        59
           1       0.82      0.71      0.76       133
           2       0.71      0.66      0.68        62

    accuracy                           0.72       254
   macro avg       0.70      0.72      0.70       254
weighted avg       0.73      0.72      0.72       254

->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->

Naive bayes with cross validation

In [30]:
scores = cross_val_score(NB, X, y, cv=9, scoring='accuracy')#Evaluate a score by cross-validation
max_NB_cross_nopca=scores.max()#selecting highest score
print(scores)#print Scores
[0.72916667 0.70526316 0.81914894 0.64893617 0.79787234 0.80851064
 0.62365591 0.72043011 0.78494624]

SVC with C=0.01,0.05,0.5,1 and kernel=rbf

In [31]:
svc = SVC(0.01,kernel ='rbf',gamma='auto')  #Instantiate SVC
svc.fit(X_train,y_train) #Call the fit method of SVC to train the model or to learn the parameters of model
predicted_svc = svc.predict(X_test) #Predict 

print('\033[1m''->'*63)
print('\033[1m''Confusion Matrix\n',confusion_matrix(y_test,predicted_svc)) #for confusion matrix
print('-'*30)
SVC_accuracy = accuracy_score(y_test,predicted_svc) #for accuracy score
print('Accuracy of SVC :',SVC_accuracy)
print('-'*30)
print('\n Classification Report\n',classification_report(y_test,predicted_svc)) #for classification report
print('->'*63)
->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->
Confusion Matrix
 [[  0  59   0]
 [  0 133   0]
 [  0  62   0]]
------------------------------
Accuracy of SVC : 0.5236220472440944
------------------------------

 Classification Report
               precision    recall  f1-score   support

           0       0.00      0.00      0.00        59
           1       0.52      1.00      0.69       133
           2       0.00      0.00      0.00        62

    accuracy                           0.52       254
   macro avg       0.17      0.33      0.23       254
weighted avg       0.27      0.52      0.36       254

->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->
C:\Users\Ajay\Anaconda3\lib\site-packages\sklearn\metrics\classification.py:1437: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
In [32]:
svc = SVC(0.05,kernel ='rbf',gamma='auto')  #Instantiate SVC
svc.fit(X_train,y_train) #Call the fit method of SVC to train the model or to learn the parameters of model
predicted_svc = svc.predict(X_test) #Predict 

print('\033[1m''->'*63)
print('\033[1m''Confusion Matrix\n',confusion_matrix(y_test,predicted_svc)) #for confusion matrix
print('-'*30)
SVC_accuracy = accuracy_score(y_test,predicted_svc) #for accuracy score
print('Accuracy of SVC :',SVC_accuracy)
print('-'*30)
print('\n Classification Report\n',classification_report(y_test,predicted_svc)) #for classification report
print('->'*63)
->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->
Confusion Matrix
 [[ 23  36   0]
 [  6 127   0]
 [  1  61   0]]
------------------------------
Accuracy of SVC : 0.5905511811023622
------------------------------

 Classification Report
               precision    recall  f1-score   support

           0       0.77      0.39      0.52        59
           1       0.57      0.95      0.71       133
           2       0.00      0.00      0.00        62

    accuracy                           0.59       254
   macro avg       0.44      0.45      0.41       254
weighted avg       0.47      0.59      0.49       254

->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->
C:\Users\Ajay\Anaconda3\lib\site-packages\sklearn\metrics\classification.py:1437: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
In [33]:
svc = SVC(0.5,kernel ='rbf',gamma='auto')  #Instantiate SVC
svc.fit(X_train,y_train) #Call the fit method of SVC to train the model or to learn the parameters of model
predicted_svc = svc.predict(X_test) #Predict 

print('\033[1m''->'*63)
print('\033[1m''Confusion Matrix\n',confusion_matrix(y_test,predicted_svc)) #for confusion matrix
print('-'*30)
SVC_accuracy = accuracy_score(y_test,predicted_svc) #for accuracy score
print('Accuracy of SVC :',SVC_accuracy)
print('-'*30)
print('\n Classification Report\n',classification_report(y_test,predicted_svc)) #for classification report
print('->'*63)
->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->
Confusion Matrix
 [[ 58   1   0]
 [ 11 115   7]
 [  3   4  55]]
------------------------------
Accuracy of SVC : 0.8976377952755905
------------------------------

 Classification Report
               precision    recall  f1-score   support

           0       0.81      0.98      0.89        59
           1       0.96      0.86      0.91       133
           2       0.89      0.89      0.89        62

    accuracy                           0.90       254
   macro avg       0.88      0.91      0.89       254
weighted avg       0.91      0.90      0.90       254

->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->
In [34]:
svc = SVC(1,kernel ='rbf',gamma='auto')  #Instantiate SVC
svc.fit(X_train,y_train) #Call the fit method of SVC to train the model or to learn the parameters of model
predicted_svc = svc.predict(X_test) #Predict 

print('\033[1m''->'*63)
print('\033[1m''Confusion Matrix\n',confusion_matrix(y_test,predicted_svc)) #for confusion matrix
print('-'*30)
SVC_accuracy1 = accuracy_score(y_test,predicted_svc) #for accuracy score
print('Accuracy of SVC :',SVC_accuracy1)
print('-'*30)
print('\n Classification Report\n',classification_report(y_test,predicted_svc)) #for classification report
print('->'*63)
->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->
Confusion Matrix
 [[ 59   0   0]
 [  9 117   7]
 [  1   3  58]]
------------------------------
Accuracy of SVC : 0.9212598425196851
------------------------------

 Classification Report
               precision    recall  f1-score   support

           0       0.86      1.00      0.92        59
           1       0.97      0.88      0.92       133
           2       0.89      0.94      0.91        62

    accuracy                           0.92       254
   macro avg       0.91      0.94      0.92       254
weighted avg       0.93      0.92      0.92       254

->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->

SVC with cross validation

In [35]:
scores = cross_val_score(svc, X, y, cv=10, scoring='accuracy')#Evaluate a score by cross-validation
max_svc_cross_nopca=scores.max()#selecting highest score
print(scores)#print Scores
[0.92941176 0.91764706 0.91764706 0.89411765 0.94117647 0.92941176
 0.91764706 0.92941176 0.97619048 0.92682927]

SVC with C=0.01,0.05,0.5,1 and kernel=Linear

In [36]:
svc = SVC(0.01,kernel = 'linear')  #Instantiate SVC
svc.fit(X_train,y_train) #Call the fit method of SVC to train the model or to learn the parameters of model
predicted_svc = svc.predict(X_test) #Predict 

print('\033[1m''->'*63)
print('\033[1m''Confusion Matrix\n',confusion_matrix(y_test,predicted_svc)) #for confusion matrix
print('-'*30)
SVC_accuracy = accuracy_score(y_test,predicted_svc) #for accuracy score
print('Accuracy of SVC :',SVC_accuracy)
print('-'*30)
print('\n Classification Report\n',classification_report(y_test,predicted_svc)) #for classification report
print('->'*63)
->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->
Confusion Matrix
 [[ 41  18   0]
 [ 18 112   3]
 [  7  14  41]]
------------------------------
Accuracy of SVC : 0.7637795275590551
------------------------------

 Classification Report
               precision    recall  f1-score   support

           0       0.62      0.69      0.66        59
           1       0.78      0.84      0.81       133
           2       0.93      0.66      0.77        62

    accuracy                           0.76       254
   macro avg       0.78      0.73      0.75       254
weighted avg       0.78      0.76      0.76       254

->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->
In [37]:
svc = SVC(0.05,kernel = 'linear')  #Instantiate SVC
svc.fit(X_train,y_train) #Call the fit method of SVC to train the model or to learn the parameters of model
predicted_svc = svc.predict(X_test) #Predict 

print('\033[1m''->'*63)
print('\033[1m''Confusion Matrix\n',confusion_matrix(y_test,predicted_svc)) #for confusion matrix
print('-'*30)
SVC_accuracy = accuracy_score(y_test,predicted_svc) #for accuracy score
print('Accuracy of SVC :',SVC_accuracy)
print('-'*30)
print('\n Classification Report\n',classification_report(y_test,predicted_svc)) #for classification report
print('->'*63)
->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->
Confusion Matrix
 [[ 45  14   0]
 [ 17 109   7]
 [  1   5  56]]
------------------------------
Accuracy of SVC : 0.8267716535433071
------------------------------

 Classification Report
               precision    recall  f1-score   support

           0       0.71      0.76      0.74        59
           1       0.85      0.82      0.84       133
           2       0.89      0.90      0.90        62

    accuracy                           0.83       254
   macro avg       0.82      0.83      0.82       254
weighted avg       0.83      0.83      0.83       254

->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->
In [38]:
svc = SVC(0.5,kernel = 'linear')  #Instantiate SVC
svc.fit(X_train,y_train) #Call the fit method of SVC to train the model or to learn the parameters of model
predicted_svc = svc.predict(X_test) #Predict 

print('\033[1m''->'*63)
print('\033[1m''Confusion Matrix\n',confusion_matrix(y_test,predicted_svc)) #for confusion matrix
print('-'*30)
SVC_accuracy = accuracy_score(y_test,predicted_svc) #for accuracy score
print('Accuracy of SVC :',SVC_accuracy)
print('-'*30)
print('\n Classification Report\n',classification_report(y_test,predicted_svc)) #for classification report
print('->'*63)
->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->
Confusion Matrix
 [[ 47  12   0]
 [ 16 111   6]
 [  0   5  57]]
------------------------------
Accuracy of SVC : 0.8464566929133859
------------------------------

 Classification Report
               precision    recall  f1-score   support

           0       0.75      0.80      0.77        59
           1       0.87      0.83      0.85       133
           2       0.90      0.92      0.91        62

    accuracy                           0.85       254
   macro avg       0.84      0.85      0.84       254
weighted avg       0.85      0.85      0.85       254

->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->
In [39]:
svc = SVC(1,kernel = 'linear')  #Instantiate SVC
svc.fit(X_train,y_train) #Call the fit method of SVC to train the model or to learn the parameters of model
predicted_svc = svc.predict(X_test) #Predict 

print('\033[1m''->'*63)
print('\033[1m''Confusion Matrix\n',confusion_matrix(y_test,predicted_svc)) #for confusion matrix
print('-'*30)
SVC_accuracy = accuracy_score(y_test,predicted_svc) #for accuracy score
print('Accuracy of SVC :',SVC_accuracy)
print('-'*30)
print('\n Classification Report\n',classification_report(y_test,predicted_svc)) #for classification report
print('->'*63)
->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->
Confusion Matrix
 [[ 46  13   0]
 [ 17 110   6]
 [  0   5  57]]
------------------------------
Accuracy of SVC : 0.8385826771653543
------------------------------

 Classification Report
               precision    recall  f1-score   support

           0       0.73      0.78      0.75        59
           1       0.86      0.83      0.84       133
           2       0.90      0.92      0.91        62

    accuracy                           0.84       254
   macro avg       0.83      0.84      0.84       254
weighted avg       0.84      0.84      0.84       254

->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->

SVC with cross validation

In [40]:
scores = cross_val_score(svc, X, y, cv=10, scoring='accuracy')#Evaluate a score by cross-validation
max_svc_cross_nopca=scores.max()#selecting highest score
print(scores)#print Scores
[0.83529412 0.78823529 0.78823529 0.8        0.85882353 0.78823529
 0.8        0.8        0.83333333 0.75609756]

K-Nearest Neighbor

In [41]:
knn = KNeighborsClassifier(n_neighbors = 3) #Instantiate KNN with k=3
knn.fit(X_train,y_train) #Call the fit method of KNN to train the model or to learn the parameters of model
y_predict = knn.predict(X_test) #Predict 

print('\033[1m''->'*63)
print('\033[1m''Confusion Matrix\n',confusion_matrix(y_test,y_predi)) #for confusion matrix
print('-'*30)
KNN_accuracy = accuracy_score(y_test,y_predict)
print('Accuracy of KNN :{:.2f}'.format(KNN_accuracy)) #for accuracy score
print('-'*30)
print('\n Classification Report\n',classification_report(y_test,y_predi)) #for classification report
print('->'*63)
->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->
Confusion Matrix
 [[46  9  4]
 [25 95 13]
 [ 9 12 41]]
------------------------------
Accuracy of KNN :0.88
------------------------------

 Classification Report
               precision    recall  f1-score   support

           0       0.57      0.78      0.66        59
           1       0.82      0.71      0.76       133
           2       0.71      0.66      0.68        62

    accuracy                           0.72       254
   macro avg       0.70      0.72      0.70       254
weighted avg       0.73      0.72      0.72       254

->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->

KNN with cross validation

In [42]:
scores = cross_val_score(knn, X, y, cv=10, scoring='accuracy')#Evaluate a score by cross-validation
max_knn_cross_nopca=scores.max()#selecting highest score
print(scores)#print Scores
[0.85882353 0.89411765 0.85882353 0.82352941 0.91764706 0.89411765
 0.82352941 0.85882353 0.91666667 0.85365854]

Decision Tree With Regulariztion

In [43]:
dTR = DecisionTreeClassifier(criterion = 'gini', max_depth = 3, random_state=1) #Instantiate Decision Tree with max_depth
dTR.fit(X_train, y_train) #Call the fit method of DT to train the model or to learn the parameters of model
predicted_DTR = dTR.predict(X_test) #Predict

print('\033[1m''->'*63)
print('\033[1m''Confusion Matrix\n',confusion_matrix(y_test,predicted_DTR)) #for confusion matrix
print('-'*30)
DTR_accuracy = accuracy_score(y_test,predicted_DTR)
print('Accuracy of Decision Tree with Regularization:{:.2f}'.format(DTR_accuracy)) #for accuracy score
print('-'*30)
print('\n Classification Report\n',classification_report(y_test,predicted_DTR)) #for classification report
print('->'*63)
->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->
Confusion Matrix
 [[53  6  0]
 [26 86 21]
 [ 1  0 61]]
------------------------------
Accuracy of Decision Tree with Regularization:0.79
------------------------------

 Classification Report
               precision    recall  f1-score   support

           0       0.66      0.90      0.76        59
           1       0.93      0.65      0.76       133
           2       0.74      0.98      0.85        62

    accuracy                           0.79       254
   macro avg       0.78      0.84      0.79       254
weighted avg       0.82      0.79      0.78       254

->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->

Decision tree with cross validation

In [44]:
scores = cross_val_score(dTR, X, y, cv=10, scoring='accuracy')#Evaluate a score by cross-validation
max_DTR_cross_nopca=scores.max()#selecting highest score
print(scores)#print Scores
[0.8        0.70588235 0.76470588 0.81176471 0.82352941 0.83529412
 0.81176471 0.75294118 0.85714286 0.81707317]

Bagging

In [45]:
bagg = BaggingClassifier(base_estimator=dTR, n_estimators=500,random_state=1) #Instantiate Bagging Classifier
bagg = bagg.fit(X_train, y_train) #Call the fit method of Bagging classifier to train the model or to learn the parameters of model
predicted_BAG = bagg.predict(X_test) #Predict


print('\033[1m''->'*63)
print('\033[1m''Confusion Matrix\n',confusion_matrix(y_test,predicted_BAG)) #for confusion matrix
print('-'*30)
BAG_accuracy = accuracy_score(y_test,predicted_BAG)
print('Accuracy of Decision Tree :{:.2f}'.format(BAG_accuracy)) #for accuracy score
print('-'*30)
print('\n Classification Report\n',classification_report(y_test,predicted_BAG)) #for classification report
print('->'*63)
->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->
Confusion Matrix
 [[57  2  0]
 [28 88 17]
 [ 1  1 60]]
------------------------------
Accuracy of Decision Tree :0.81
------------------------------

 Classification Report
               precision    recall  f1-score   support

           0       0.66      0.97      0.79        59
           1       0.97      0.66      0.79       133
           2       0.78      0.97      0.86        62

    accuracy                           0.81       254
   macro avg       0.80      0.87      0.81       254
weighted avg       0.85      0.81      0.80       254

->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->

Bagging with cross validation

In [46]:
scores = cross_val_score(bagg, X, y, cv=10, scoring='accuracy')#Evaluate a score by cross-validation
max_bagg_cross_nopca=scores.max() #selecting highest score
print(scores)#print Scores
[0.83529412 0.71764706 0.78823529 0.83529412 0.82352941 0.82352941
 0.83529412 0.77647059 0.85714286 0.82926829]

Adaptive Boosting

In [47]:
Aboost = AdaBoostClassifier(n_estimators=50, random_state=1) #Instantiate Adaptive boosting Classifier
Aboost = Aboost.fit(X_train, y_train) #Call the fit method of Adaptive boosting Classifier to train the model or to learn the parameters of model
predicted_ADA = Aboost.predict(X_test) #Predict

print('\033[1m''->'*63)
print('\033[1m''Confusion Matrix\n',confusion_matrix(y_test,predicted_ADA)) #for confusion matrix
print('-'*30)
ADA_accuracy = accuracy_score(y_test,predicted_ADA)
print('Accuracy of KNN :{:.2f}'.format(ADA_accuracy)) #for accuracy score
print('-'*30)
print('\n Classification Report\n',classification_report(y_test,predicted_ADA)) #for classification report
print('->'*63)
->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->
Confusion Matrix
 [[ 30  29   0]
 [  4 122   7]
 [  0  18  44]]
------------------------------
Accuracy of KNN :0.77
------------------------------

 Classification Report
               precision    recall  f1-score   support

           0       0.88      0.51      0.65        59
           1       0.72      0.92      0.81       133
           2       0.86      0.71      0.78        62

    accuracy                           0.77       254
   macro avg       0.82      0.71      0.74       254
weighted avg       0.79      0.77      0.76       254

->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->

Adaptive boosting with cross validation

In [48]:
scores = cross_val_score(Aboost, X, y, cv=10, scoring='accuracy')#Evaluate a score by cross-validation
max_Aboost_cross_nopca=scores.max() #selecting highest score
print(scores)#print Scores
[0.78823529 0.76470588 0.77647059 0.77647059 0.75294118 0.64705882
 0.71764706 0.70588235 0.71428571 0.80487805]

Gradient Boosting

In [49]:
Gboost = GradientBoostingClassifier(n_estimators = 100,random_state=1) #Instantiate Gradient boosting Classifier
Gboost = Gboost.fit(X_train, y_train)#Call the fit method of Gradient boosting Classifier to train the model or to learn the parameters of model
predicted_GRAD = Gboost.predict(X_test) #Predict


print('\033[1m''->'*63)
print('\033[1m''Confusion Matrix\n',confusion_matrix(y_test,predicted_GRAD)) #for confusion matrix
print('-'*30)
GRAD_accuracy = accuracy_score(y_test,predicted_GRAD)
print('Accuracy of KNN :{:.2f}'.format(GRAD_accuracy)) #for accuracy score
print('-'*30)
print('\n Classification Report\n',classification_report(y_test,predicted_GRAD)) #for classification report
print('->'*63)
->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->
Confusion Matrix
 [[ 58   1   0]
 [  4 119  10]
 [  0   5  57]]
------------------------------
Accuracy of KNN :0.92
------------------------------

 Classification Report
               precision    recall  f1-score   support

           0       0.94      0.98      0.96        59
           1       0.95      0.89      0.92       133
           2       0.85      0.92      0.88        62

    accuracy                           0.92       254
   macro avg       0.91      0.93      0.92       254
weighted avg       0.92      0.92      0.92       254

->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->

Gradient Boosting with cross validation

In [50]:
scores = cross_val_score(Gboost, X, y, cv=10, scoring='accuracy')#Evaluate a score by cross-validation
max_Gboost_cross_nopca=scores.max()#selecting highest score
print(scores)#print Scores
[0.91764706 0.95294118 0.94117647 0.97647059 1.         0.89411765
 0.90588235 0.88235294 0.97619048 0.95121951]

Random Forest

In [51]:
#n=100
Rforest = RandomForestClassifier(n_estimators = 100, random_state=1, max_features=3)#Instantiate Random Forest Classifier
Rforest = Rforest.fit(X_train, y_train) #Call the fit method of Random Forest Classifier to train the model or to learn the parameters of model
predicted_RAN = Rforest.predict(X_test) #Predict

print('\033[1m''->'*63)
print('\033[1m''Confusion Matrix\n',confusion_matrix(y_test,predicted_RAN )) #for confusion matrix
print('-'*30)
RAN_accuracy = accuracy_score(y_test,predicted_RAN )
print('Accuracy of KNN :{:.2f}'.format(RAN_accuracy)) #for accuracy score
print('-'*30)
print('\n Classification Report\n',classification_report(y_test,predicted_RAN )) #for classification report
print('->'*63)
->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->
Confusion Matrix
 [[ 57   2   0]
 [  4 121   8]
 [  0   3  59]]
------------------------------
Accuracy of KNN :0.93
------------------------------

 Classification Report
               precision    recall  f1-score   support

           0       0.93      0.97      0.95        59
           1       0.96      0.91      0.93       133
           2       0.88      0.95      0.91        62

    accuracy                           0.93       254
   macro avg       0.93      0.94      0.93       254
weighted avg       0.93      0.93      0.93       254

->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->

Random Forest with cross validation

In [52]:
scores = cross_val_score(Rforest, X, y, cv=10, scoring='accuracy')#Evaluate a score by cross-validation
max_Rforest_cross_nopca=scores.max()#selecting highest score
print(scores)#print Scores
[0.94117647 0.92941176 0.94117647 0.92941176 0.96470588 0.92941176
 0.89411765 0.92941176 0.96428571 0.92682927]

WITH PCA:

In [53]:
X = pca_df.drop('class',axis=1) #independent dimensions  
y = pca_df['class'] #selecting target column
Xscaled = X.apply(zscore) #Scaling with zscore
In [54]:
Xscaled.head() #head of scaled dataframe
Out[54]:
compactness circularity distance_circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio scatter_ratio elongatedness pr.axis_rectangularity max.length_rectangularity scaled_variance scaled_variance.1 scaled_radius_of_gyration scaled_radius_of_gyration.1 skewness_about skewness_about.1 skewness_about.2 hollows_ratio
0 0.160580 0.517302 0.056545 0.272965 0.571005 1.485660 -0.208038 0.136580 -0.225160 0.758332 -0.403077 -0.343028 0.285618 -0.004412 0.508072 0.380665 -0.312193 0.183957
1 -0.325470 -0.624564 0.120112 -0.835442 -0.679645 0.847718 -0.599893 0.520853 -0.611739 -0.344578 -0.594546 -0.620879 -0.513719 0.553906 1.649716 0.156589 0.013088 0.452977
2 1.254193 0.843549 1.518571 1.201630 0.571005 1.485660 1.148382 -1.144331 0.934576 0.689401 1.096764 1.108603 1.392391 0.833064 0.646700 -0.403603 -0.149552 0.049447
3 -0.082445 -0.624564 -0.007021 -0.296217 0.924126 0.847718 -0.750606 0.648945 -0.611739 -0.344578 -0.913661 -0.739958 -1.466773 -1.958521 0.508072 -0.291565 1.639494 1.529056
4 -1.054545 -0.135193 -0.769817 1.081803 0.571005 0.571729 -0.599893 0.520853 -0.611739 -0.275646 1.671171 -0.649231 0.408593 0.678897 1.649716 -0.179527 -1.450677 -1.699181

PCA

In [55]:
pca = PCA() #PCA
pca.fit(Xscaled) #fit Scaled data into PCA
Out[55]:
PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False)

Eigen Values

In [56]:
print(pca.explained_variance_) #Eigen Values
[9.54343830e+00 3.27537324e+00 1.17891706e+00 1.06444499e+00
 8.88603304e-01 7.68927922e-01 3.50334013e-01 2.66296631e-01
 1.99311130e-01 1.62628696e-01 9.60760158e-02 8.00809954e-02
 4.95717419e-02 3.60430041e-02 2.84105211e-02 2.03773743e-02
 9.40399613e-03 3.06283735e-03]

Eigen vectors

In [57]:
print(pca.components_) #Eigen vectors
[[ 2.73386365e-01  2.91677852e-01  3.04976365e-01  2.62684432e-01
   8.64666936e-02  1.56313715e-01  3.14630653e-01 -3.12637226e-01
   3.11592695e-01  2.81788574e-01  3.04115120e-01  3.11435896e-01
   2.69221778e-01  1.47893193e-02  2.08384441e-02  5.84361644e-02
   3.14603835e-02  7.75517866e-02]
 [ 1.01000418e-01 -1.24872126e-01  5.96014451e-02  1.93970977e-01
   2.92208467e-01  1.61633682e-01 -5.92407339e-02 -9.59263362e-04
  -7.17154884e-02 -1.14308322e-01 -5.81930156e-02 -6.60009356e-02
  -2.02756042e-01 -4.67185508e-01 -4.61531581e-03  1.03686372e-01
   5.07729078e-01  5.13927844e-01]
 [ 5.73312564e-02 -2.04435190e-01  7.96202949e-02 -5.60094911e-02
  -3.59662750e-01 -1.46991610e-01  1.15223261e-01 -6.04740839e-02
   1.24466774e-01 -1.96196144e-01  1.08848164e-01  1.21427784e-01
  -2.10477181e-01 -5.19601260e-02 -2.45023617e-01  7.66251988e-01
  -4.69567791e-02 -2.41646430e-02]
 [-1.87308525e-01 -1.78350606e-05  2.30886326e-02  1.35960162e-01
   3.41860046e-01  1.90441885e-01 -2.31606704e-02 -2.85261722e-02
  -3.80110322e-02  7.00887714e-03  1.73504900e-02 -4.52529374e-02
  -3.93150153e-02  1.73373732e-01 -8.58903801e-01 -4.93420174e-02
  -1.24972942e-01 -1.39693143e-02]
 [ 5.34911273e-02 -1.11354428e-01 -1.35875332e-01  2.51111825e-01
   2.84777304e-01 -7.84443718e-01  8.67925178e-02 -1.06719821e-01
   7.35386595e-02 -2.41050208e-01  1.78703215e-01  1.23723902e-01
  -1.57225088e-02  6.57715025e-02 -4.30878648e-02 -1.64767241e-01
   1.74152099e-01 -1.14826327e-01]
 [ 2.18691189e-01  3.98919492e-02 -3.28839016e-02 -1.98736484e-01
  -5.77514362e-01 -1.25321256e-01  5.36891269e-02 -1.36941513e-02
   6.84846621e-02  6.81943396e-02  7.73065287e-03  8.21720016e-02
   3.96852866e-03 -3.34620594e-01 -4.13877347e-01 -4.69583547e-01
   1.84135806e-01  6.21166504e-02]
 [ 1.91725643e-01 -4.11041320e-01  1.49526940e-01  1.51726135e-01
  -7.90921311e-02  3.74801577e-01  9.28452054e-02 -1.42032200e-01
   7.66548720e-02 -4.02085554e-01  1.90004552e-01  8.29569322e-02
  -4.03546076e-01  1.93444518e-01  1.24768909e-01 -3.42159598e-01
  -8.43335696e-02 -1.51085524e-01]
 [ 5.16772622e-01  4.67981008e-02 -8.85658867e-02 -7.95559923e-02
  -4.89147400e-02 -5.28393921e-02 -8.81278889e-02  8.96397585e-02
  -6.59898603e-02  1.82066267e-01 -1.59916812e-01 -7.19156792e-02
  -1.44855109e-01  6.74053597e-01 -6.21447954e-02  6.74561664e-02
   2.92418612e-01  2.41031557e-01]
 [-5.65522333e-01  1.15704972e-02  1.52657263e-02  3.32628303e-01
  -4.21177535e-01  3.56223509e-02 -6.16348025e-02 -1.26939037e-01
  -1.32960632e-01 -3.04324002e-02  2.48731975e-01 -7.98795529e-02
   1.42061028e-01  3.14415073e-01  4.26597009e-02 -3.06986529e-02
   3.44154681e-01  2.08586370e-01]
 [ 3.53951135e-01  6.53218838e-02 -2.13046299e-01  6.36995373e-01
  -1.73939595e-01  5.50104966e-02 -2.12944946e-01  3.22477408e-01
  -2.00184250e-01  6.64878756e-02  1.99128114e-01 -2.04466697e-01
   7.97232629e-02 -1.68191491e-01 -3.28001378e-02  8.26495911e-02
  -1.81307499e-01 -1.87896384e-01]
 [-2.25999300e-01  1.30141423e-01  2.46132140e-02  2.34409938e-01
  -5.98494987e-02 -1.70615906e-01  7.50983235e-02  2.88184731e-02
   1.22956525e-01  4.88106513e-01 -1.02743528e-01  6.40541261e-02
  -7.08648631e-01 -4.05657076e-02  7.25143756e-02 -7.76692877e-02
  -2.06454288e-01  1.07254650e-01]
 [-5.44435751e-02  5.69225190e-03 -8.09582382e-01 -1.00249998e-01
   7.15896240e-02  2.64070257e-01  1.38286501e-01 -1.67548045e-01
   1.18930975e-01  1.41242993e-01  1.78666953e-01  1.44115083e-01
  -1.07084167e-01 -3.88011973e-02  1.56146562e-02  8.23834457e-02
   2.41721993e-01 -2.10375855e-01]
 [-1.99167712e-02 -1.07253152e-01 -3.51953501e-01  2.08325582e-01
  -9.40555052e-02  1.38917415e-02  1.50699017e-01  8.49711068e-02
   2.90651249e-01 -2.29300611e-01 -3.23277333e-01  1.90733790e-01
   2.23865239e-01  7.12315577e-02  1.14294970e-02 -6.01627982e-02
  -3.93758296e-01  5.38451172e-01]
 [-7.32521593e-02 -2.55964898e-01  2.93518943e-02 -2.50132714e-01
   1.08946058e-01 -1.45722694e-02 -6.80903581e-02  5.94837223e-01
   1.46790658e-01  1.75011703e-01  5.92383575e-01  1.67363272e-01
   3.51548567e-02  4.21506329e-02  1.03502908e-02 -4.80025996e-02
  -4.86855181e-02  2.37148507e-01]
 [ 1.30823838e-01 -4.45969250e-02 -1.35416036e-01 -1.96060878e-01
   2.26416234e-02 -1.08847660e-01 -5.78008106e-02 -5.05778829e-01
  -3.56742934e-01  7.41526684e-02  3.83041554e-01 -2.78465041e-01
  -3.66067794e-02 -4.73078898e-02  2.25571219e-02 -1.49943128e-02
  -3.90363854e-01  3.72668509e-01]
 [ 7.43585024e-03 -7.50928189e-01  4.51394628e-02  1.12855367e-01
  -6.08666360e-03 -3.94089322e-02 -1.07470546e-02 -1.75868791e-01
   6.60006538e-02  5.04348525e-01 -2.04087945e-01 -1.03791764e-01
   2.50984763e-01 -3.28791832e-02 -8.03303380e-03  7.21357062e-03
   4.27362260e-02 -1.07129518e-01]
 [-1.21053092e-03 -9.01475626e-02  7.51428737e-03  3.27868468e-02
  -1.49208501e-03  1.41911816e-02  3.64028202e-01  1.17795536e-01
  -7.31990263e-01  6.68680965e-02 -9.70074675e-02  5.41807792e-01
   2.26114072e-02  1.50493548e-03 -1.27852591e-03  3.02405891e-03
  -1.90078443e-02  3.42573011e-03]
 [-1.08055845e-02 -1.06408939e-03  1.83806270e-03 -1.50575457e-02
   4.77873413e-03 -9.73395472e-03  7.87182496e-01  2.17540344e-01
  -5.40413885e-03 -1.79048821e-02  3.31883582e-02 -5.73083734e-01
   9.20296942e-03  1.50508920e-02 -1.12517611e-03 -1.32059291e-02
   4.67681765e-02 -1.98932548e-03]]

Percentage of variance explained by each eigen vector

In [58]:
print(pca.explained_variance_ratio_) #Percentage of variance explained
[5.29564314e-01 1.81750091e-01 6.54179746e-02 5.90659321e-02
 4.93084970e-02 4.26677236e-02 1.94399948e-02 1.47767700e-02
 1.10597521e-02 9.02424795e-03 5.33124727e-03 4.44368539e-03
 2.75073036e-03 2.00002223e-03 1.57649661e-03 1.13073820e-03
 5.21826683e-04 1.69956499e-04]

Elbow method

In [59]:
plt.bar(list(range(1,19)),pca.explained_variance_ratio_,alpha=0.5, align='center') #bar plot of Eigen Values vs Variation explained
plt.ylabel('Variation explained')#set y label
plt.xlabel('Eigen Value')# set x label
plt.show()
In [60]:
plt.step(list(range(1,19)),np.cumsum(pca.explained_variance_ratio_)) #step plot of Eigen Values vs Variation explained
plt.ylabel('Variation explained')#set y label
plt.xlabel('Eigen Value')# set x label
plt.show()
  • Around 95% variance is covered by 7 features,but I have chosen 8 features to cover more than 95% of variance of the data.

Dimensionality Reduction

In [61]:
pcad = PCA(n_components=8) #8 features
pcad.fit(Xscaled) #Fit into PCA 
print(pcad.components_) #Eigen vectors
print(pcad.explained_variance_ratio_) #Percentage of variance explained
Reduced_dimension = pcad.transform(Xscaled) #reduce dimensions to 8
[[ 2.73386365e-01  2.91677852e-01  3.04976365e-01  2.62684432e-01
   8.64666936e-02  1.56313715e-01  3.14630653e-01 -3.12637226e-01
   3.11592695e-01  2.81788574e-01  3.04115120e-01  3.11435896e-01
   2.69221778e-01  1.47893193e-02  2.08384441e-02  5.84361644e-02
   3.14603835e-02  7.75517866e-02]
 [ 1.01000418e-01 -1.24872126e-01  5.96014451e-02  1.93970977e-01
   2.92208467e-01  1.61633682e-01 -5.92407339e-02 -9.59263362e-04
  -7.17154884e-02 -1.14308322e-01 -5.81930156e-02 -6.60009356e-02
  -2.02756042e-01 -4.67185508e-01 -4.61531581e-03  1.03686372e-01
   5.07729078e-01  5.13927844e-01]
 [ 5.73312564e-02 -2.04435190e-01  7.96202949e-02 -5.60094911e-02
  -3.59662750e-01 -1.46991610e-01  1.15223261e-01 -6.04740839e-02
   1.24466774e-01 -1.96196144e-01  1.08848164e-01  1.21427784e-01
  -2.10477181e-01 -5.19601260e-02 -2.45023617e-01  7.66251988e-01
  -4.69567791e-02 -2.41646430e-02]
 [-1.87308525e-01 -1.78350606e-05  2.30886326e-02  1.35960162e-01
   3.41860046e-01  1.90441885e-01 -2.31606704e-02 -2.85261722e-02
  -3.80110322e-02  7.00887714e-03  1.73504900e-02 -4.52529374e-02
  -3.93150153e-02  1.73373732e-01 -8.58903801e-01 -4.93420174e-02
  -1.24972942e-01 -1.39693143e-02]
 [ 5.34911273e-02 -1.11354428e-01 -1.35875332e-01  2.51111825e-01
   2.84777304e-01 -7.84443718e-01  8.67925178e-02 -1.06719821e-01
   7.35386595e-02 -2.41050208e-01  1.78703215e-01  1.23723902e-01
  -1.57225088e-02  6.57715025e-02 -4.30878648e-02 -1.64767241e-01
   1.74152099e-01 -1.14826327e-01]
 [ 2.18691189e-01  3.98919492e-02 -3.28839016e-02 -1.98736484e-01
  -5.77514362e-01 -1.25321256e-01  5.36891269e-02 -1.36941513e-02
   6.84846621e-02  6.81943396e-02  7.73065287e-03  8.21720016e-02
   3.96852866e-03 -3.34620594e-01 -4.13877347e-01 -4.69583547e-01
   1.84135806e-01  6.21166504e-02]
 [ 1.91725643e-01 -4.11041320e-01  1.49526940e-01  1.51726135e-01
  -7.90921311e-02  3.74801577e-01  9.28452054e-02 -1.42032200e-01
   7.66548720e-02 -4.02085554e-01  1.90004552e-01  8.29569322e-02
  -4.03546076e-01  1.93444518e-01  1.24768909e-01 -3.42159598e-01
  -8.43335696e-02 -1.51085524e-01]
 [ 5.16772622e-01  4.67981008e-02 -8.85658867e-02 -7.95559923e-02
  -4.89147400e-02 -5.28393921e-02 -8.81278889e-02  8.96397585e-02
  -6.59898603e-02  1.82066267e-01 -1.59916812e-01 -7.19156792e-02
  -1.44855109e-01  6.74053597e-01 -6.21447954e-02  6.74561664e-02
   2.92418612e-01  2.41031557e-01]]
[0.52956431 0.18175009 0.06541797 0.05906593 0.0493085  0.04266772
 0.01943999 0.01477677]

Pairplot after PCA

In [62]:
sns.pairplot(pd.DataFrame(Reduced_dimension)) #pairplot of principle components
Out[62]:
<seaborn.axisgrid.PairGrid at 0x23c1f1663c8>

Understanding from pairplot after PCA

  • All features cointain more information and lesser noise.
  • The scatter spread of the dataframe after pca depicts that the PCA has covered large amount of variance.

Train Test Split After PCA(70:30)

In [63]:
X_train1,X_test1,y_train1,y_test1 = train_test_split(Reduced_dimension,y,test_size=0.30,random_state=1) #train test split in 70:30 ratio

Models:

Naive Bayes

In [64]:
NB = GaussianNB() #Instantiate the Gaussian Naive bayes 
NB.fit(X_train1,y_train1) #Call the fit method of NB to train the model or to learn the parameters of model
y_predi = NB.predict(X_test1) #Predict 
In [65]:
print('\033[1m''->'*63)
print('\033[1m''Confusion Matrix\n',confusion_matrix(y_test1,y_predi)) #for confusion matrix
print('-'*30)
NB_accuracyWithpca = accuracy_score(y_test1,y_predi)
print('Accuracy of Naive Bayes :{:.2f}'.format(NB_accuracyWithpca)) #for accuracy score
print('-'*30)
print('\n Classification Report\n',classification_report(y_test1,y_predi)) #for classification report
print('->'*63)
->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->
Confusion Matrix
 [[ 34  17   8]
 [ 10 116   7]
 [  3  11  48]]
------------------------------
Accuracy of Naive Bayes :0.78
------------------------------

 Classification Report
               precision    recall  f1-score   support

           0       0.72      0.58      0.64        59
           1       0.81      0.87      0.84       133
           2       0.76      0.77      0.77        62

    accuracy                           0.78       254
   macro avg       0.76      0.74      0.75       254
weighted avg       0.78      0.78      0.78       254

->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->

Naive bayes with cross validation

In [66]:
scores = cross_val_score(NB, Xscaled, y, cv=10, scoring='accuracy')#Evaluate a score by cross-validation
max_NB_cross=scores.max()#selecting highest score
print(scores)#print Scores
[0.6        0.52941176 0.61176471 0.6        0.56470588 0.63529412
 0.61176471 0.54117647 0.6547619  0.59756098]

SVC with C=0.01,0.05,0.5,1 and kernel=rbf

In [67]:
svc1 = SVC(1,kernel ='rbf',gamma='auto')  #Instantiate SVC
svc1.fit(X_train1,y_train1) #Call the fit method of SVC to train the model or to learn the parameters of model
predicted_svc = svc1.predict(X_test1) #Predict 

print('\033[1m''->'*63)
print('\033[1m''Confusion Matrix\n',confusion_matrix(y_test1,predicted_svc)) #for confusion matrix
print('-'*30)
SVC_accuracyWithpca1 = accuracy_score(y_test1,predicted_svc) #for accuracy score
print('Accuracy of SVC :',SVC_accuracyWithpca1)
print('-'*30)
print('\n Classification Report\n',classification_report(y_test1,predicted_svc)) #for classification report
print('->'*63)
->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->
Confusion Matrix
 [[ 58   1   0]
 [  3 123   7]
 [  4   3  55]]
------------------------------
Accuracy of SVC : 0.9291338582677166
------------------------------

 Classification Report
               precision    recall  f1-score   support

           0       0.89      0.98      0.94        59
           1       0.97      0.92      0.95       133
           2       0.89      0.89      0.89        62

    accuracy                           0.93       254
   macro avg       0.92      0.93      0.92       254
weighted avg       0.93      0.93      0.93       254

->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->
In [68]:
svc = SVC(0.01,kernel ='rbf',gamma='auto')  #Instantiate SVC
svc.fit(X_train1,y_train1) #Call the fit method of SVC to train the model or to learn the parameters of model
predicted_svc = svc.predict(X_test1) #Predict 

print('\033[1m''->'*63)
print('\033[1m''Confusion Matrix\n',confusion_matrix(y_test1,predicted_svc)) #for confusion matrix
print('-'*30)
SVC_accuracyWithpca = accuracy_score(y_test1,predicted_svc) #for accuracy score
print('Accuracy of SVC :',SVC_accuracyWithpca)
print('-'*30)
print('\n Classification Report\n',classification_report(y_test1,predicted_svc)) #for classification report
print('->'*63)
->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->
Confusion Matrix
 [[  0  59   0]
 [  0 133   0]
 [  0  62   0]]
------------------------------
Accuracy of SVC : 0.5236220472440944
------------------------------

 Classification Report
               precision    recall  f1-score   support

           0       0.00      0.00      0.00        59
           1       0.52      1.00      0.69       133
           2       0.00      0.00      0.00        62

    accuracy                           0.52       254
   macro avg       0.17      0.33      0.23       254
weighted avg       0.27      0.52      0.36       254

->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->
C:\Users\Ajay\Anaconda3\lib\site-packages\sklearn\metrics\classification.py:1437: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
In [69]:
svc = SVC(0.05,kernel ='rbf',gamma='auto')  #Instantiate SVC
svc.fit(X_train1,y_train1) #Call the fit method of SVC to train the model or to learn the parameters of model
predicted_svc = svc.predict(X_test1) #Predict 

print('\033[1m''->'*63)
print('\033[1m''Confusion Matrix\n',confusion_matrix(y_test1,predicted_svc)) #for confusion matrix
print('-'*30)
SVC_accuracyWithpca = accuracy_score(y_test1,predicted_svc) #for accuracy score
print('Accuracy of SVC :',SVC_accuracyWithpca)
print('-'*30)
print('\n Classification Report\n',classification_report(y_test1,predicted_svc)) #for classification report
print('->'*63)
->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->
Confusion Matrix
 [[ 27  32   0]
 [  4 129   0]
 [  2  44  16]]
------------------------------
Accuracy of SVC : 0.6771653543307087
------------------------------

 Classification Report
               precision    recall  f1-score   support

           0       0.82      0.46      0.59        59
           1       0.63      0.97      0.76       133
           2       1.00      0.26      0.41        62

    accuracy                           0.68       254
   macro avg       0.82      0.56      0.59       254
weighted avg       0.76      0.68      0.64       254

->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->
In [70]:
svc = SVC(0.5,kernel ='rbf',gamma='auto')  #Instantiate SVC
svc.fit(X_train1,y_train1) #Call the fit method of SVC to train the model or to learn the parameters of model
predicted_svc = svc.predict(X_test1) #Predict 

print('\033[1m''->'*63)
print('\033[1m''Confusion Matrix\n',confusion_matrix(y_test1,predicted_svc)) #for confusion matrix
print('-'*30)
SVC_accuracyWithpca = accuracy_score(y_test1,predicted_svc) #for accuracy score
print('Accuracy of SVC :',SVC_accuracyWithpca)
print('-'*30)
print('\n Classification Report\n',classification_report(y_test1,predicted_svc)) #for classification report
print('->'*63)
->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->
Confusion Matrix
 [[ 58   0   1]
 [  3 125   5]
 [  3   5  54]]
------------------------------
Accuracy of SVC : 0.9330708661417323
------------------------------

 Classification Report
               precision    recall  f1-score   support

           0       0.91      0.98      0.94        59
           1       0.96      0.94      0.95       133
           2       0.90      0.87      0.89        62

    accuracy                           0.93       254
   macro avg       0.92      0.93      0.93       254
weighted avg       0.93      0.93      0.93       254

->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->

SVC with cross Validation

In [71]:
scores = cross_val_score(svc, Xscaled, y, cv=10, scoring='accuracy')#Evaluate a score by cross-validation
max_svc_cross=scores.max()#selecting highest score
print(scores)#print Scores
[0.94117647 0.96470588 0.92941176 0.95294118 0.96470588 0.95294118
 0.94117647 0.95294118 0.95238095 0.92682927]

SVC with C=0.01,0.05,0.5,1 and kernel=Linear

In [72]:
svc = SVC(0.01,kernel = 'linear')  #Instantiate SVC
svc.fit(X_train1,y_train1) #Call the fit method of SVC to train the model or to learn the parameters of model
predicted_svc = svc.predict(X_test1) #Predict 

print('\033[1m''->'*63)
print('\033[1m''Confusion Matrix\n',confusion_matrix(y_test1,predicted_svc)) #for confusion matrix
print('-'*30)
SVC_accuracyWithpca = accuracy_score(y_test1,predicted_svc) #for accuracy score
print('Accuracy of SVC :',SVC_accuracyWithpca)
print('-'*30)
print('\n Classification Report\n',classification_report(y_test1,predicted_svc)) #for classification report
print('->'*63)
->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->
Confusion Matrix
 [[ 48  10   1]
 [ 17 108   8]
 [  3   6  53]]
------------------------------
Accuracy of SVC : 0.8228346456692913
------------------------------

 Classification Report
               precision    recall  f1-score   support

           0       0.71      0.81      0.76        59
           1       0.87      0.81      0.84       133
           2       0.85      0.85      0.85        62

    accuracy                           0.82       254
   macro avg       0.81      0.83      0.82       254
weighted avg       0.83      0.82      0.82       254

->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->
In [73]:
svc = SVC(0.05,kernel = 'linear')  #Instantiate SVC
svc.fit(X_train1,y_train1) #Call the fit method of SVC to train the model or to learn the parameters of model
predicted_svc = svc.predict(X_test1) #Predict 

print('\033[1m''->'*63)
print('\033[1m''Confusion Matrix\n',confusion_matrix(y_test1,predicted_svc)) #for confusion matrix
print('-'*30)
SVC_accuracyWithpca = accuracy_score(y_test1,predicted_svc) #for accuracy score
print('Accuracy of SVC :',SVC_accuracyWithpca)
print('-'*30)
print('\n Classification Report\n',classification_report(y_test1,predicted_svc)) #for classification report
print('->'*63)
->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->
Confusion Matrix
 [[ 43  14   2]
 [ 16 109   8]
 [  2   7  53]]
------------------------------
Accuracy of SVC : 0.8070866141732284
------------------------------

 Classification Report
               precision    recall  f1-score   support

           0       0.70      0.73      0.72        59
           1       0.84      0.82      0.83       133
           2       0.84      0.85      0.85        62

    accuracy                           0.81       254
   macro avg       0.79      0.80      0.80       254
weighted avg       0.81      0.81      0.81       254

->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->
In [74]:
svc = SVC(0.5,kernel = 'linear')  #Instantiate SVC
svc.fit(X_train1,y_train1) #Call the fit method of SVC to train the model or to learn the parameters of model
predicted_svc = svc.predict(X_test1) #Predict 

print('\033[1m''->'*63)
print('\033[1m''Confusion Matrix\n',confusion_matrix(y_test1,predicted_svc)) #for confusion matrix
print('-'*30)
SVC_accuracyWithpca = accuracy_score(y_test1,predicted_svc) #for accuracy score
print('Accuracy of SVC :',SVC_accuracyWithpca)
print('-'*30)
print('\n Classification Report\n',classification_report(y_test1,predicted_svc)) #for classification report
print('->'*63)
->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->
Confusion Matrix
 [[ 48   9   2]
 [ 16 111   6]
 [  0   8  54]]
------------------------------
Accuracy of SVC : 0.8385826771653543
------------------------------

 Classification Report
               precision    recall  f1-score   support

           0       0.75      0.81      0.78        59
           1       0.87      0.83      0.85       133
           2       0.87      0.87      0.87        62

    accuracy                           0.84       254
   macro avg       0.83      0.84      0.83       254
weighted avg       0.84      0.84      0.84       254

->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->
In [75]:
svc = SVC(1,kernel = 'linear')  #Instantiate SVC
svc.fit(X_train1,y_train1) #Call the fit method of SVC to train the model or to learn the parameters of model
predicted_svc = svc.predict(X_test1) #Predict 

print('\033[1m''->'*63)
print('\033[1m''Confusion Matrix\n',confusion_matrix(y_test1,predicted_svc)) #for confusion matrix
print('-'*30)
SVC_accuracyWithpca = accuracy_score(y_test1,predicted_svc) #for accuracy score
print('Accuracy of SVC :',SVC_accuracyWithpca)
print('-'*30)
print('\n Classification Report\n',classification_report(y_test1,predicted_svc)) #for classification report
print('->'*63)
->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->
Confusion Matrix
 [[ 48   9   2]
 [ 16 111   6]
 [  0   8  54]]
------------------------------
Accuracy of SVC : 0.8385826771653543
------------------------------

 Classification Report
               precision    recall  f1-score   support

           0       0.75      0.81      0.78        59
           1       0.87      0.83      0.85       133
           2       0.87      0.87      0.87        62

    accuracy                           0.84       254
   macro avg       0.83      0.84      0.83       254
weighted avg       0.84      0.84      0.84       254

->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->

SVC with cross Validation

In [76]:
scores = cross_val_score(svc, Xscaled, y, cv=10, scoring='accuracy')#Evaluate a score by cross-validation
max_svc_cross=scores.max()#selecting highest score
print(scores)#print Scores
[0.94117647 0.88235294 0.88235294 0.90588235 0.96470588 0.89411765
 0.89411765 0.91764706 0.96428571 0.90243902]

KNN

In [77]:
knn = KNeighborsClassifier(n_neighbors = 3)  #Instantiate KNN with k=3
knn.fit(X_train1,y_train1) #Call the fit method of KNN to train the model or to learn the parameters of model
y_predict = knn.predict(X_test1) #Predict 

print('\033[1m''->'*63)
print('\033[1m''Confusion Matrix\n',confusion_matrix(y_test1,y_predi)) #for confusion matrix
print('-'*30)
KNN_accuracyWithpca = accuracy_score(y_test1,y_predict)
print('Accuracy of KNN :{:.2f}'.format(KNN_accuracyWithpca)) #for accuracy score
print('-'*30)
print('\n Classification Report\n',classification_report(y_test1,y_predi)) #for classification report
print('->'*63)
->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->
Confusion Matrix
 [[ 34  17   8]
 [ 10 116   7]
 [  3  11  48]]
------------------------------
Accuracy of KNN :0.90
------------------------------

 Classification Report
               precision    recall  f1-score   support

           0       0.72      0.58      0.64        59
           1       0.81      0.87      0.84       133
           2       0.76      0.77      0.77        62

    accuracy                           0.78       254
   macro avg       0.76      0.74      0.75       254
weighted avg       0.78      0.78      0.78       254

->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->

KNN with cross validation

In [78]:
scores = cross_val_score(knn, Xscaled, y, cv=10, scoring='accuracy')#Evaluate a score by cross-validation
max_knn_cross=scores.max() #selecting highest score
print(scores)#print Scores
[0.90588235 0.91764706 0.88235294 0.91764706 0.96470588 0.96470588
 0.91764706 0.91764706 0.91666667 0.8902439 ]

Decision Tree

In [79]:
dTR = DecisionTreeClassifier(criterion = 'gini', max_depth = 3, random_state=1) #Instantiate Decision Tree with max_depth
dTR.fit(X_train1, y_train1) #Call the fit method of DT to train the model or to learn the parameters of model
predicted_DTR = dTR.predict(X_test1) #Predict

print('\033[1m''->'*63)
print('\033[1m''Confusion Matrix\n',confusion_matrix(y_test1,predicted_DTR)) #for confusion matrix
print('-'*30)
DTR_accuracyWithpca = accuracy_score(y_test1,predicted_DTR)
print('Accuracy of Decision Tree with Regularization:{:.2f}'.format(DTR_accuracyWithpca)) #for accuracy score
print('-'*30)
print('\n Classification Report\n',classification_report(y_test1,predicted_DTR)) #for classification report
print('->'*63)
->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->
Confusion Matrix
 [[ 31  21   7]
 [  3 107  23]
 [  1  11  50]]
------------------------------
Accuracy of Decision Tree with Regularization:0.74
------------------------------

 Classification Report
               precision    recall  f1-score   support

           0       0.89      0.53      0.66        59
           1       0.77      0.80      0.79       133
           2       0.62      0.81      0.70        62

    accuracy                           0.74       254
   macro avg       0.76      0.71      0.72       254
weighted avg       0.76      0.74      0.74       254

->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->

Decision tree with cross validation

In [80]:
scores = cross_val_score(dTR, Xscaled, y, cv=10, scoring='accuracy')#Evaluate a score by cross-validation
max_DTR_cross=scores.max()#selecting highest score
print(scores)#print Scores
[0.8        0.70588235 0.78823529 0.81176471 0.81176471 0.84705882
 0.81176471 0.74117647 0.85714286 0.81707317]

Bagging

In [81]:
bagg = BaggingClassifier(base_estimator=dTR, n_estimators=500,random_state=1) #Instantiate Bagging Classifier
bagg = bagg.fit(X_train1, y_train1) #Call the fit method of Bagging classifier to train the model or to learn the parameters of model
predicted_BAG = bagg.predict(X_test1) #Predict


print('\033[1m''->'*63)
print('\033[1m''Confusion Matrix\n',confusion_matrix(y_test1,predicted_BAG)) #for confusion matrix
print('-'*30)
BAG_accuracyWithpca = accuracy_score(y_test1,predicted_BAG)
print('Accuracy of Decision Tree :{:.2f}'.format(BAG_accuracyWithpca)) #for accuracy score
print('-'*30)
print('\n Classification Report\n',classification_report(y_test1,predicted_BAG)) #for classification report
print('->'*63)
->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->
Confusion Matrix
 [[ 38  16   5]
 [  5 117  11]
 [  3  13  46]]
------------------------------
Accuracy of Decision Tree :0.79
------------------------------

 Classification Report
               precision    recall  f1-score   support

           0       0.83      0.64      0.72        59
           1       0.80      0.88      0.84       133
           2       0.74      0.74      0.74        62

    accuracy                           0.79       254
   macro avg       0.79      0.76      0.77       254
weighted avg       0.79      0.79      0.79       254

->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->

Bagging with cross validation

In [82]:
scores = cross_val_score(dTR, Xscaled, y, cv=10, scoring='accuracy')#Evaluate a score by cross-validation
max_bagg_cross=scores.max()#selecting highest score
print(scores)#print Scores
[0.8        0.70588235 0.78823529 0.81176471 0.81176471 0.84705882
 0.81176471 0.74117647 0.85714286 0.81707317]

Adaptive Boosting

In [83]:
Aboost = AdaBoostClassifier(n_estimators=50, random_state=1) #Instantiate Adaptive boosting Classifier
Aboost = Aboost.fit(X_train1, y_train1) #Call the fit method of Adaptive boosting Classifier to train the model or to learn the parameters of model
predicted_ADA = Aboost.predict(X_test1) #Predict

print('\033[1m''->'*63)
print('\033[1m''Confusion Matrix\n',confusion_matrix(y_test1,predicted_ADA)) #for confusion matrix
print('-'*30)
ADA_accuracyWithpca = accuracy_score(y_test1,predicted_ADA)
print('Accuracy of KNN :{:.2f}'.format(ADA_accuracyWithpca)) #for accuracy score
print('-'*30)
print('\n Classification Report\n',classification_report(y_test1,predicted_ADA)) #for classification report
print('->'*63)
->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->
Confusion Matrix
 [[ 46  13   0]
 [  9 119   5]
 [  3  12  47]]
------------------------------
Accuracy of KNN :0.83
------------------------------

 Classification Report
               precision    recall  f1-score   support

           0       0.79      0.78      0.79        59
           1       0.83      0.89      0.86       133
           2       0.90      0.76      0.82        62

    accuracy                           0.83       254
   macro avg       0.84      0.81      0.82       254
weighted avg       0.84      0.83      0.83       254

->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->

Adaptive boosting with cross validation

In [84]:
scores = cross_val_score(Aboost, Xscaled, y, cv=10, scoring='accuracy')#Evaluate a score by cross-validation
max_Aboost_cross=scores.max() #selecting highest score
print(scores)#print Scores
[0.83529412 0.84705882 0.78823529 0.84705882 0.85882353 0.84705882
 0.75294118 0.74117647 0.85714286 0.86585366]

Gradient Boosting

In [85]:
Gboost = GradientBoostingClassifier(n_estimators = 100,random_state=1) #Instantiate Gradient boosting Classifier
Gboost = Gboost.fit(X_train1, y_train1)#Call the fit method of Gradient boosting Classifier to train the model or to learn the parameters of model
predicted_GRAD = Gboost.predict(X_test1) #Predict


print('\033[1m''->'*63)
print('\033[1m''Confusion Matrix\n',confusion_matrix(y_test1,predicted_GRAD)) #for confusion matrix
print('-'*30)
GRAD_accuracyWithpca = accuracy_score(y_test1,predicted_GRAD)
print('Accuracy of KNN :{:.2f}'.format(GRAD_accuracyWithpca)) #for accuracy score
print('-'*30)
print('\n Classification Report\n',classification_report(y_test1,predicted_GRAD)) #for classification report
print('->'*63)
->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->
Confusion Matrix
 [[ 56   1   2]
 [  5 124   4]
 [  1  10  51]]
------------------------------
Accuracy of KNN :0.91
------------------------------

 Classification Report
               precision    recall  f1-score   support

           0       0.90      0.95      0.93        59
           1       0.92      0.93      0.93       133
           2       0.89      0.82      0.86        62

    accuracy                           0.91       254
   macro avg       0.91      0.90      0.90       254
weighted avg       0.91      0.91      0.91       254

->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->

Gradient Boosting with cross validation

In [86]:
scores = cross_val_score(Gboost, Xscaled, y, cv=10, scoring='accuracy')#Evaluate a score by cross-validation
max_Gboost_cross=scores.max()#selecting highest score
print(scores)#print Scores
[0.95294118 0.91764706 0.95294118 0.97647059 0.98823529 0.94117647
 0.94117647 0.96470588 0.96428571 0.98780488]

Random Forest

In [87]:
#n=100
Rforest = RandomForestClassifier(n_estimators = 100, random_state=1, max_features=3)#Instantiate Random Forest Classifier
Rforest = Rforest.fit(X_train1, y_train1) #Call the fit method of Random Forest Classifier to train the model or to learn the parameters of model
predicted_RAN = Rforest.predict(X_test1) #Predict

print('\033[1m''->'*63)
print('\033[1m''Confusion Matrix\n',confusion_matrix(y_test1,predicted_RAN )) #for confusion matrix
print('-'*30)
RAN_accuracyWithpca = accuracy_score(y_test1,predicted_RAN )
print('Accuracy of KNN :{:.2f}'.format(RAN_accuracyWithpca)) #for accuracy score
print('-'*30)
print('\n Classification Report\n',classification_report(y_test1,predicted_RAN )) #for classification report
print('->'*63)
->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->
Confusion Matrix
 [[ 56   0   3]
 [  6 122   5]
 [  2  10  50]]
------------------------------
Accuracy of KNN :0.90
------------------------------

 Classification Report
               precision    recall  f1-score   support

           0       0.88      0.95      0.91        59
           1       0.92      0.92      0.92       133
           2       0.86      0.81      0.83        62

    accuracy                           0.90       254
   macro avg       0.89      0.89      0.89       254
weighted avg       0.90      0.90      0.90       254

->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->->

Random Forest with cross validation

In [88]:
scores = cross_val_score(Rforest, Xscaled, y, cv=10, scoring='accuracy')#Evaluate a score by cross-validation
max_Rforest_cross=scores.max() #selecting highest score
print(scores)#print Scores
[0.95294118 0.92941176 0.97647059 0.97647059 0.96470588 0.95294118
 0.95294118 0.98823529 0.96428571 0.97560976]

Models with accuracy scores

  • All Accuracy scores are calculated on Test data
In [89]:
Scores = [('Naive bayes', NB_accuracy,NB_accuracyWithpca,max_NB_cross_nopca,max_NB_cross),
      ('KNN', KNN_accuracy,KNN_accuracyWithpca,max_knn_cross_nopca,max_knn_cross),
      ('SVC', SVC_accuracy1,SVC_accuracyWithpca1,max_svc_cross_nopca,max_svc_cross),
      ('Decision Tree with Regularization',DTR_accuracy,DTR_accuracyWithpca,max_DTR_cross_nopca,max_DTR_cross),
      ('Bagging',BAG_accuracy,BAG_accuracyWithpca,max_bagg_cross_nopca,max_bagg_cross),
      ('Adaptive Boosting',ADA_accuracy,ADA_accuracyWithpca,max_Aboost_cross_nopca,max_Aboost_cross),
      ('Gradient Boosting',GRAD_accuracy,GRAD_accuracyWithpca,max_Gboost_cross_nopca,max_Gboost_cross),
      ('Random Forest N=100',RAN_accuracy,RAN_accuracyWithpca,max_Rforest_cross_nopca,max_Rforest_cross)] #List of accuracy scores of all models

Scores = pd.DataFrame(Scores,columns=['Model','Accuracy score without PCA and reduced dimensions','Accuracy score with PCA and reduced dimensions','Maximum Accuracy with cross validation without PCA and reduced dimensions','Maximum Accuracy with cross validation and PCA and reduced dimensions']) #Conversion of list to dataframe
Sorted=Scores.sort_values(by='Accuracy score with PCA and reduced dimensions',ascending=True) #Sort values in descending manner
Sorted
Out[89]:
Model Accuracy score without PCA and reduced dimensions Accuracy score with PCA and reduced dimensions Maximum Accuracy with cross validation without PCA and reduced dimensions Maximum Accuracy with cross validation and PCA and reduced dimensions
3 Decision Tree with Regularization 0.787402 0.740157 0.857143 0.857143
0 Naive bayes 0.716535 0.779528 0.819149 0.654762
4 Bagging 0.807087 0.791339 0.857143 0.857143
5 Adaptive Boosting 0.771654 0.834646 0.804878 0.865854
1 KNN 0.881890 0.897638 0.917647 0.964706
7 Random Forest N=100 0.933071 0.897638 0.964706 0.988235
6 Gradient Boosting 0.921260 0.909449 1.000000 0.988235
2 SVC 0.921260 0.929134 0.858824 0.964706

Comparison of accuracy scores on test data without cross validation

In [90]:
ax = Sorted.plot(x='Model', y='Accuracy score without PCA and reduced dimensions', legend=False,rot=90)
ax2 = ax.twinx()
Sorted.plot(x='Model', y='Accuracy score with PCA and reduced dimensions', ax=ax2, legend=False, color="r")
ax.figure.legend()
ax.legend(bbox_to_anchor=(1.1,0), bbox_transform=fig.transFigure)
plt.show()

Comparison of accuracy scores on test data with cross validation

In [91]:
ax = Sorted.plot(x='Model', y='Maximum Accuracy with cross validation without PCA and reduced dimensions', legend=False,rot=90)
ax2 = ax.twinx()
Sorted.plot(x='Model', y='Maximum Accuracy with cross validation and PCA and reduced dimensions', ax=ax2, legend=False, color="r")
ax.figure.legend()
ax.legend(bbox_to_anchor=(1.4,0), bbox_transform=fig.transFigure)
plt.show()

Conclusion:

Comments on Dataset:

  • The dataset had multicollinearity between features.
  • The dataset had missing values in multiple features.
  • After imputing outliers with mean, some were created on the lower side,but the number was less.
  • The dataset is labelled i.e. a target column is present in the dataset, so it can be solved using supervised machine learning models.

Comments on Models:

  • For SVC in terms of accuracy scores,best hyper parameters are C=1 and kernel='rbf'.
  • For SVC with PCA and reduced dimensions in terms of Precison and Recall, besy hyper parameters are C=0.5 and kernel ='rbf'.
  • For SVC without PCA and reduced dimensions in terms of Precison and Recall, besy hyper parameters are C=1 and kernel ='rbf'.

  • In cross validation, the accuracy score of all models has increased significantly. On one instance Gradient boosting has accuracy of 100%.

  • After PCA and dimensionality reduction around 97% variance has been covered in 8 features as per elbow method.
  • Naive bayes classifier produced lowest accuracy of 65% with cross validation and reduced dimensions.

Miscellaneous Comments:

  • Due to three pairplots,2 boxplot plotting loops and corr plot this project takes a bit more time to run.
In [ ]: